:orphan: Core Basics 3: Train a Classifier on a Snowflake Multi-Table Dataset ==================================================================== In this notebook, we learn how to train a classifier with a more complex multi-table data where a secondary table is itself a parent table of another table (ie. snowflake schema). It is highly recommended to see the *Basics 1* and *Basics 2* lessons if you are not familiar with Khiops. Make sure you have installed `Khiops `__ and `Khiops Visualization `__. We start by importing Khiops, checking its installation and defining some helper functions: .. code:: ipython3 import os import platform import subprocess from khiops import core as kh # Define helper functions def peek(file_path, n=10): """Shows the first n lines of a file""" with open(file_path, encoding="utf8", errors="replace") as file: for line in file.readlines()[:n]: print(line, end="") print("") # If there are any issues you may Khiops status with the following command # kh.get_runner().print_status() Training a Multi-Table Classifier ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ We’ll train a multi-table classifier on a extension of dataset ``AccidentsSummary`` that we used in the previous notebook *Sklearn Basics 2*. This dataset ``Accidents`` contains two additional tables ``Place`` and ``User`` and is organized in the following relational snowflake schema: :: Accident | | -- 1:n -- Vehicle | | | |-- 1:n -- User | | -- 1:1 -- Place Note that the target variable is ``Gravity``. To train the KhiopsClassifier for this setup, this schema must be codified in the dictionary file. Let’s check the contents of the ``Accidents`` dictionary file: .. code:: ipython3 accidents_dataset_dir = os.path.join(kh.get_samples_dir(), "Accidents") accidents_kdic = os.path.join(accidents_dataset_dir, "Accidents.kdic") print(f"Accidents dictionary file location: {accidents_kdic}") print("") peek(accidents_kdic, n=45) .. parsed-literal:: Accidents dictionary file location: /github/home/khiops_data/samples/Accidents/Accidents.kdic Root Dictionary Accident(AccidentId) { Categorical AccidentId; Categorical Gravity; Date Date; Time Hour; Categorical Light; Categorical Department; Categorical Commune; Categorical InAgglomeration; Categorical IntersectionType; Categorical Weather; Categorical CollisionType; Categorical PostalAddress; Categorical GPSCode; Numerical Latitude; Numerical Longitude; Entity(Place) Place; Table(Vehicle) Vehicles; }; Dictionary Place(AccidentId) { Categorical AccidentId; Categorical RoadType; Categorical RoadNumber; Categorical RoadSecNumber; Categorical RoadLetter; Categorical Circulation; Numerical LaneNumber; Categorical SpecialLane; Categorical Slope; Categorical RoadMarkerId; Numerical RoadMarkerDistance; Categorical Layout; Numerical StripWidth; Numerical LaneWidth; Categorical SurfaceCondition; Categorical Infrastructure; Categorical Localization; Categorical SchoolNear; }; Dictionary Vehicle(AccidentId, VehicleId) Note the following differences in comparison with the dictionary of dataset ``AccidentsSummary``. - The schema for the main table contains one extra special variable defined with the statement ``Entity(Place) Place`` which indicate a ``1:1`` relationship between ``Accident`` and ``Place`` tables. - The main table ``Accident`` and entity ``Place`` have the same key ``AccidentId``. Table ``Vehicle`` and its child table ``User`` have two keys ``AccidentId`` and ``VehicleId``. Now let’s store the location of the tables and peek their contents: .. code:: ipython3 accidents_data_file = os.path.join(accidents_dataset_dir, "Accidents.txt") print(f"Accidents data table: {accidents_data_file}") print("") peek(accidents_data_file) vehicles_data_file = os.path.join(accidents_dataset_dir, "Vehicles.txt") print(f"Vehicles data table: {vehicles_data_file}") print("") peek(vehicles_data_file) places_data_file = os.path.join(accidents_dataset_dir, "Places.txt") print(f"Places data table: {places_data_file}") print("") peek(places_data_file) users_data_file = os.path.join(accidents_dataset_dir, "Users.txt") print(f"Users data table: {users_data_file}") print("") peek(users_data_file) .. parsed-literal:: Accidents data table: /github/home/khiops_data/samples/Accidents/Accidents.txt AccidentId Gravity Date Hour Light Department Commune InAgglomeration IntersectionType Weather CollisionType PostalAddress GPSCode Latitude Longitude 201800000001 NonLethal 2018-01-24 15:05:00 Daylight 590 005 No Y-type Normal 2Vehicles-BehindVehicles-Frontal route des Ansereuilles M 50.55737 2.55737 201800000002 NonLethal 2018-02-12 10:15:00 Daylight 590 011 Yes Square VeryGood NoCollision Place du général de Gaul M 50.52936 2.52936 201800000003 NonLethal 2018-03-04 11:35:00 Daylight 590 477 Yes T-type Normal NoCollision Rue nationale M 50.51243 2.51243 201800000004 NonLethal 2018-05-05 17:35:00 Daylight 590 052 Yes NoIntersection VeryGood 2Vehicles-Side 30 rue Jules Guesde M 50.51974 2.51974 201800000005 NonLethal 2018-06-26 16:05:00 Daylight 590 477 Yes NoIntersection Normal 2Vehicles-Side 72 rue Victor Hugo M 50.51607 2.51607 201800000006 NonLethal 2018-09-23 06:30:00 TwilightOrDawn 590 052 Yes NoIntersection LightRain Other D39 M 50.52132 2.52132 201800000007 NonLethal 2018-09-26 00:40:00 NightStreelightsOn 590 133 Yes NoIntersection Normal Other 4 route de camphin M 50.52211 2.52211 201800000008 Lethal 2018-11-30 17:15:00 NightStreelightsOn 590 011 Yes NoIntersection Normal Other rue saint exupéry M 50.53146 2.53146 201800000009 NonLethal 2018-02-18 15:57:00 Daylight 590 550 No NoIntersection Normal Other rue de l'égalité M 50.53707 2.53707 Vehicles data table: /github/home/khiops_data/samples/Accidents/Vehicles.txt AccidentId VehicleId Direction Category PassengerNumber FixedObstacle MobileObstacle ImpactPoint Maneuver 201800000001 A01 Unknown Car<=3.5T 0 None Vehicle RightFront TurnToLeft 201800000001 B01 Unknown Car<=3.5T 0 None Vehicle LeftFront NoDirectionChange 201800000002 A01 Unknown Car<=3.5T 0 None Pedestrian None NoDirectionChange 201800000003 A01 Unknown Motorbike>125cm3 0 StationaryVehicle Vehicle Front NoDirectionChange 201800000003 B01 Unknown Car<=3.5T 0 None Vehicle LeftSide TurnToLeft 201800000003 C01 Unknown Car<=3.5T 0 None None RightSide Parked 201800000004 A01 Unknown Car<=3.5T 0 None Other RightFront Avoidance 201800000004 B01 Unknown Bicycle 0 None Vehicle LeftSide None 201800000005 A01 Unknown Moped 0 None Vehicle RightFront PassLeft Places data table: /github/home/khiops_data/samples/Accidents/Places.txt AccidentId RoadType RoadNumber RoadSecNumber RoadLetter Circulation LaneNumber SpecialLane Slope RoadMarkerId RoadMarkerDistance Layout StripWidth LaneWidth SurfaceCondition Infrastructure Localization SchoolNear 201800000001 Departamental 41 C TwoWay 2 0 Flat RightCurve Normal Unknown Lane 00 201800000002 Communal 41 D TwoWay 2 0 Flat LeftCurve Normal Unknown Lane 00 201800000003 Departamental 39 D TwoWay 2 0 Flat Straight Normal Unknown Lane 00 201800000004 Departamental 39 TwoWay 2 0 Flat Straight Normal Unknown Lane 00 201800000005 Communal OneWay 1 0 Flat Straight Normal Unknown Lane 00 201800000006 Departamental 39 D Unknown 2 0 Uphill LeftCurve Wet Unknown Shoulder 00 201800000007 Departamental 41 D TwoWay 2 0 Flat 16 500 Straight Normal Unknown Shoulder 00 201800000008 Communal - TwoWay 2 0 Flat Straight Normal Unknown Lane 00 201800000009 Departamental 141 D TwoWay 2 0 Flat Straight Normal Unknown Shoulder 00 Users data table: /github/home/khiops_data/samples/Accidents/Users.txt AccidentId VehicleId Seat Category Gender TripReason SafetyDevice SafetyDeviceUsed PedestrianLocation PedestrianAction PedestrianCompany BirthYear 201800000001 A01 1 Driver Male Leisure SeatBelt Yes None None Unknown 1960 201800000001 B01 1 Driver Male None SeatBelt Yes None None Unknown 1928 201800000002 A01 1 Driver Male None SeatBelt Yes None None Unknown 1947 201800000002 A01 Pedestrian Male None Helmet OnLane<=OnSidewalk0mCrossing Crossing Alone 1959 201800000003 A01 1 Driver Male Leisure Helmet Yes None None Unknown 1987 201800000003 C01 1 Driver Male None ChildrenDevice None None Unknown 1977 201800000004 A01 1 Driver Male Leisure SeatBelt Yes None None Unknown 1982 201800000004 B01 1 Driver Male Leisure Helmet None None Unknown 2013 201800000005 A01 1 Driver Male Leisure Helmet Yes None None Unknown 2001 Train a classifier for the ``Accidents`` database with 1000 variables ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The call to the train_predictor is exactly the same as seen before on the exercice of the previous notebook *Sklearn Basics 2*. The only difference is the extension of the dictionary ``additional_data_tables``, which contains paths of the additional tables, with two new paths: - Path of entity ``Place`` is :literal:`Accident`Place`. - Path of table ``User`` is :literal:`Accident`Vehicles`Users`. Same as previously, we’ll ask Khiops to create 1000 additional features with its multi-table AutoML mode. Do not forget: - The target variable is ``Gravity`` - Set ``max_trees=0`` With these considerations, let’s now train the classifier: .. code:: ipython3 accidents_results_dir = os.path.join("exercises", "Accidents") accidents_report, accidents_model_kdic = kh.train_predictor( accidents_kdic, dictionary_name="Accident", data_table_path=accidents_data_file, target_variable="Gravity", results_dir=accidents_results_dir, additional_data_tables={ "Accident`Vehicles": vehicles_data_file, "Accident`Place": places_data_file, "Accident`Vehicles`Users": users_data_file, }, max_constructed_variables=1000, max_trees=0, ) print(f"Accidents report file: {accidents_report}") print(f"Accidents modeling dictionary file: {accidents_model_kdic}") .. parsed-literal:: Accidents report file: exercises/Accidents/AllReports.khj Accidents modeling dictionary file: exercises/Accidents/Modeling.kdic Take a look to the report ^^^^^^^^^^^^^^^^^^^^^^^^^ Which variables predict well the gravity of an accident? .. code:: ipython3 # To visualize uncomment the line below # kh.visualize_report(accidents_report)